AITopics | topic classification

Collaborating Authors

topic classification

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

If you are looking for an answer to the question What is Artificial Intelligence? and you only have a minute, then here's the definition the Association for the Advancement of Artificial Intelligence offers on its home page: "the scientific understanding of the mechanisms underlying thought and intelligent behavior and their embodiment in machines."

However, if you are fortunate enough to have more than a minute, then please get ready to embark upon an exciting journey exploring AI (but beware, it could last a lifetime) …

3acb2a202ae4bea8840224e6fce16fd0-Supplemental.pdf

Neural Information Processing SystemsFeb-8-2026, 03:35:31 GMT

arxiv preprint arxiv, information, representation, (15 more...)

Neural Information Processing Systems

Country:

North America > United States > Colorado > Boulder County > Boulder (0.14)
North America > United States > California > Santa Clara County > Stanford (0.04)
North America > United States > California > Santa Clara County > Palo Alto (0.04)
(6 more...)

Genre: Research Report (0.46)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning (0.69)
Information Technology > Artificial Intelligence > Natural Language > Grammars & Parsing (0.69)
Information Technology > Artificial Intelligence > Natural Language > Machine Translation (0.68)

Add feedback

Financial Text Classification Based On rLoRA Finetuning On Qwen3-8B model

Lian, Zhiming

arXiv.org Artificial IntelligenceDec-2-2025

Financial text classification has increasingly become an important aspect in quantitative trading systems and related tasks, such as financial sentiment analysis and the classification of financial news. In this paper, we assess the performance of the large language model Qwen3-8B on both tasks. Qwen3-8B is a state-of-the-art model that exhibits strong instruction-following and multilingual capabilities, and is distinct from standard models, primarily because it is specifically optimized for efficient fine tuning and high performance on reasoning-based benchmarks, making it suitable for financial applications. To adapt this model, we apply Noisy Embedding Instruction Finetuning and based on our previous work, this method increases robustness by injecting controlled noise into the embedding layers during supervised adaptation. We improve efficiency further with Rank-stabilized Low-Rank Adaptation low-rank optimization approach, and FlashAttention, which allow for faster training with lower GPU memory. For both tasks, we benchmark Qwen3-8B against standard classical transformer models, such as T5, BERT, and RoBERTa, and large models at scale, such as LLaMA1-7B, LLaMA2-7B, and Baichuan2-7B. The findings reveal that Qwen3-8B consistently surpasses these baselines by obtaining better classification accuracy and needing fewer training epochs. The synergy of instruction-based fine-tuning and memory-efficient optimization methods suggests Qwen3-8B can potentially serve as a scalable, economical option for real-time financial NLP applications. Qwen3-8B provides a very promising base for advancing dynamic quantitative trading systems in the future.

classification, large language model, machine learning, (18 more...)

arXiv.org Artificial Intelligence

2512.0063

Country: North America > United States (0.14)

Genre: Research Report > New Finding (0.47)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Text Classification (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

SemImage: Semantic Image Representation for Text, a Novel Framework for Embedding Disentangled Linguistic Features

Zare, Mohammad

arXiv.org Artificial IntelligenceDec-2-2025

We propose SemImage, a novel method for representing a text document as a two-dimensional semantic image to be processed by convolutional neural networks (CNNs). In a SemImage, each word is represented as a pixel in a 2D image: rows correspond to sentences and an additional boundary row is inserted between sentences to mark semantic transitions. Each pixel is not a typical RGB value but a vector in a disentangled HSV color space, encoding different linguistic features: the Hue with two components H_cos and H_sin to account for circularity encodes the topic, Saturation encodes the sentiment, and Value encodes intensity or certainty. We enforce this disentanglement via a multi-task learning framework: a ColorMapper network maps each word embedding to the HSV space, and auxiliary supervision is applied to the Hue and Saturation channels to predict topic and sentiment labels, alongside the main task objective. The insertion of dynamically computed boundary rows between sentences yields sharp visual boundaries in the image when consecutive sentences are semantically dissimilar, effectively making paragraph breaks salient. We integrate SemImage with standard 2D CNNs (e.g., ResNet) for document classification. Experiments on multi-label datasets (with both topic and sentiment annotations) and single-label benchmarks demonstrate that SemImage can achieve competitive or better accuracy than strong text classification baselines (including BERT and hierarchical attention networks) while offering enhanced interpretability. An ablation study confirms the importance of the multi-channel HSV representation and the dynamic boundary rows. Finally, we present visualizations of SemImage that qualitatively reveal clear patterns corresponding to topic shifts and sentiment changes in the generated image, suggesting that our representation makes these linguistic features visible to both humans and machines.

machine learning, natural language, text classification, (21 more...)

arXiv.org Artificial Intelligence

2512.00088

Genre: Research Report > New Finding (0.46)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Text Classification (1.00)
Information Technology > Artificial Intelligence > Natural Language > Discourse & Dialogue (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

State of the Art in Text Classification for South Slavic Languages: Fine-Tuning or Prompting?

Pungeršek, Taja Kuzman, Rupnik, Peter, Porupski, Ivan, Dinić, Vuk, Ljubešić, Nikola

arXiv.org Artificial IntelligenceNov-12-2025

Until recently, fine-tuned BERT-like models provided state-of-the-art performance on text classification tasks. With the rise of instruction-tuned decoder-only models, commonly known as large language models (LLMs), the field has increasingly moved toward zero-shot and few-shot prompting. However, the performance of LLMs on text classification, particularly on less-resourced languages, remains under-explored. In this paper, we evaluate the performance of current language models on text classification tasks across several South Slavic languages. We compare openly available fine-tuned BERT-like models with a selection of open-source and closed-source LLMs across three tasks in three domains: sentiment classification in parliamentary speeches, topic classification in news articles and parliamentary speeches, and genre identification in web texts. Our results show that LLMs demonstrate strong zero-shot performance, often matching or surpassing fine-tuned BERT-like models. Moreover, when used in a zero-shot setup, LLMs perform comparably in South Slavic languages and English. However, we also point out key drawbacks of LLMs, including less predictable outputs, significantly slower inference, and higher computational costs. Due to these limitations, fine-tuned BERT-like models remain a more practical choice for large-scale automatic text annotation.

classification, large language model, machine learning, (19 more...)

arXiv.org Artificial Intelligence

2511.07989

Country: Europe (1.00)

Genre: Research Report > New Finding (1.00)

Industry: Government (0.46)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

Ibom NLP: A Step Toward Inclusive Natural Language Processing for Nigeria's Minority Languages

Kalejaiye, Oluwadara, Beyene, Luel Hagos, Adelani, David Ifeoluwa, Edet, Mmekut-Mfon Gabriel, Akpan, Aniefon Daniel, Urua, Eno-Abasi, Andy, Anietie

arXiv.org Artificial IntelligenceNov-11-2025

Nigeria is the most populous country in Africa with a population of more than 200 million people. More than 500 languages are spoken in Nigeria and it is one of the most linguistically diverse countries in the world. Despite this, natural language processing (NLP) research has mostly focused on the following four languages: Hausa, Igbo, Nigerian-Pidgin, and Yoruba (i.e <1% of the languages spoken in Nigeria). This is in part due to the unavailability of textual data in these languages to train and apply NLP algorithms. In this work, we introduce ibom -- a dataset for machine translation and topic classification in four Coastal Nigerian languages from the Akwa Ibom State region: Anaang, Efik, Ibibio, and Oro. These languages are not represented in Google Translate or in major benchmarks such as Flores-200 or SIB-200. We focus on extending Flores-200 benchmark to these languages, and further align the translated texts with topic labels based on SIB-200 classification dataset. Our evaluation shows that current LLMs perform poorly on machine translation for these languages in both zero-and-few shot settings. However, we find the few-shot samples to steadily improve topic classification with more shots.

computational linguistic, large language model, machine learning, (18 more...)

arXiv.org Artificial Intelligence

2511.06531

Country: Africa > Nigeria > Akwa Ibom State (0.26)

Genre: Research Report (0.50)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Machine Translation (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.33)

Add feedback

A Multi-Task Benchmark for Abusive Language Detection in Low-Resource Settings

Gaim, Fitsum, Song, Hoyun, Lee, Huije, Ko, Changgeon, Hwang, Eui Jun, Park, Jong C.

arXiv.org Artificial IntelligenceOct-28-2025

Content moderation research has recently made significant advances, but remains limited in serving the majority of the world's languages due to the lack of resources, leaving millions of vulnerable users to online hostility. This work presents a large-scale human-annotated multi-task benchmark dataset for abusive language detection in Tigrinya social media with joint annotations for three tasks: abusiveness, sentiment, and topic classification. The dataset comprises 13,717 YouTube comments annotated by nine native speakers, collected from 7,373 videos with a total of over 1.2 billion views across 51 channels. We developed an iterative term clustering approach for effective data selection. Recognizing that around 64% of Tigrinya social media content uses Romanized transliterations rather than native Ge'ez script, our dataset accommodates both writing systems to reflect actual language use. We establish strong baselines across the tasks in the benchmark, while leaving significant challenges for future contributions. Our experiments demonstrate that small fine-tuned models outperform prompted frontier large language models (LLMs) in the low-resource setting, achieving 86.67% F1 in abusiveness detection (7+ points over best LLM), and maintain stronger performance in all other tasks. The benchmark is made public to promote research on online safety.

computational linguistic, large language model, machine learning, (18 more...)

arXiv.org Artificial Intelligence

2505.12116

Country:

Europe (1.00)
North America > United States > Minnesota (0.28)
Asia > Middle East > UAE (0.28)

Genre: Research Report > New Finding (0.93)

Industry: Leisure & Entertainment (0.46)

Technology:

Information Technology > Communications > Social Media (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

Language Through a Prism: A Spectral Approach for Multiscale Language Representations Alex T amkin Stanford University Dan Jurafsky Stanford University Noah Goodman Stanford University

Neural Information Processing SystemsOct-2-2025, 17:18:38 GMT

We approach this question by focusing on individual neurons, analyzing the behavior of their activations at different timescales.

artificial intelligence, machine learning, natural language, (20 more...)

Neural Information Processing Systems

Country:

North America > United States > California (0.46)
North America > United States > Colorado (0.28)

Genre: Research Report (0.46)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning (0.69)
Information Technology > Artificial Intelligence > Natural Language > Grammars & Parsing (0.69)
Information Technology > Artificial Intelligence > Natural Language > Machine Translation (0.68)

Add feedback

Design, Implementation and Evaluation of a Novel Programming Language Topic Classification Workflow

Zhang, Michael, Tian, Yuan, Guizani, Mariam

arXiv.org Artificial IntelligenceSep-26-2025

As software systems grow in scale and complexity, understanding the distribution of programming language topics within source code becomes increasingly important for guiding technical decisions, improving onboarding, and informing tooling and education. This paper presents the design, implementation, and evaluation of a novel programming language topic classification workflow. Our approach combines a multi-label Support Vector Machine (SVM) with a sliding window and voting strategy to enable fine-grained localization of core language concepts such as operator overloading, virtual functions, inheritance, and templates. Trained on the IBM Project CodeNet dataset, our model achieves an average F1 score of 0.90 across topics and 0.75 in code-topic highlight. Our findings contribute empirical insights and a reusable pipeline for researchers and practitioners interested in code analysis and data-driven software engineering.

classification, machine learning, programming language, (19 more...)

arXiv.org Artificial Intelligence

2509.20631

Country: North America > Canada (0.15)

Genre: Research Report > New Finding (0.67)

Technology:

Information Technology > Software > Programming Languages (1.00)
Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Support Vector Machines (0.86)
Information Technology > Artificial Intelligence > Machine Learning > Performance Analysis > Accuracy (0.69)

Add feedback

A Rigorous Evaluation of LLM Data Generation Strategies for Low-Resource Languages

Anikina, Tatiana, Cegin, Jan, Simko, Jakub, Ostermann, Simon

arXiv.org Artificial IntelligenceSep-22-2025

Large Language Models (LLMs) are increasingly used to generate synthetic textual data for training smaller specialized models. However, a comparison of various generation strategies for low-resource language settings is lacking. While various prompting strategies have been proposed, such as demonstrations, label-based summaries, and self-revision, their comparative effectiveness remains unclear, especially for low-resource languages. In this paper, we systematically evaluate the performance of these generation strategies and their combinations across 11 typologically diverse languages, including several extremely low-resource ones. Using three NLP tasks and four open-source LLMs, we assess downstream model performance on generated versus gold-standard data. Our results show that strategic combinations of generation methods, particularly target-language demonstrations with LLM-based revisions, yield strong performance, narrowing the gap with real data to as little as 5% in some settings. We also find that smart prompting techniques can reduce the advantage of larger LLMs, highlighting efficient generation strategies for synthetic data generation in low-resource scenarios with smaller models.

large language model, machine learning, natural language, (18 more...)

arXiv.org Artificial Intelligence

2506.12158

Country:

Europe (1.00)
Asia (0.68)
North America > United States > Minnesota (0.28)

Genre: Research Report > New Finding (1.00)

Industry:

Government (1.00)
Leisure & Entertainment (0.93)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.48)

Add feedback

mSTEB: Massively Multilingual Evaluation of LLMs on Speech and Text Tasks

Beyene, Luel Hagos, Verma, Vivek, Ma, Min, Alabi, Jesujoba O., Schmidt, Fabian David, Nakatumba-Nabende, Joyce, Adelani, David Ifeoluwa

arXiv.org Artificial IntelligenceAug-28-2025

--Large Language models (LLMs) have demonstrated impressive performance on a wide range of tasks, including in multimodal settings such as speech. However, their evaluation is often limited to English and a few high-resource languages. For low-resource languages, there is no standardized evaluation benchmark. In this paper, we address this gap by introducing mSTEB, a new benchmark to evaluate the performance of LLMs on a wide range of tasks covering language identification, text classification, question answering, and translation tasks on both speech and text modalities. We evaluated the performance of leading LLMs such as Gemini 2.0 Flash and GPT -4o and state-of-the-art open models such as Qwen 2 Audio and Gemma 3 27B. Our evaluation shows a wide gap in performance between high-resource and low-resource languages, especially for languages spoken in Africa and Americas/Oceania. Our findings show that more investment is needed to address their under-representation in LLMs coverage. Recently, Large language models (LLMs) have demonstrated impressive performance on various tasks [1]-[3], including in multimodal settings such as vision and speech [4]- [8].

large language model, machine learning, natural language, (20 more...)

arXiv.org Artificial Intelligence

2506.084

Country:

Europe (1.00)
Asia (1.00)
North America > United States (0.46)
North America > Canada > Quebec (0.28)

Genre: Research Report > New Finding (0.86)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback